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Abstract: This paper presents a Markov chain Monte Carlo method to generate approx- 
imate posterior samples in retrospective multiple changepoint problems where the number of 
changes is not known in advance. The method uses conjugate models whereby the marginal 
likelihood for the data between consecutive changepoints is tractable. Inclusion of hyper- 
priors gives a near automatic algorithm providing a robust alternative to popular filtering 
recursions approaches in cases which may be sensitive to prior information. Three real ex- 
amples are used to demonstrate the proposed approach. 
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1 Introduction 

The range of applications of changepoint models is evident from the substantial volume of 
literature devoted to this problem in the econometrics, signal processing and bioinformatics 
literatures. A process generating data can often undergo changes over time such that one 
model will not be appropriate for all time periods. Here "time" refers to some natural 
sequential indexing of the data. Some examples are occurences of coal mining disasters 
during the 18^ and 19^ century (Raftery & Akman 1986), DNA or protein composition 
analysis over base number (Liu & Lawrence 1999) and winning streaks in sports (Yang 2004). 

Markov chain Monte Carlo (MCMC) techniques can be used to estimate models with a 
fixed number of changepoints. When the number of changepoints is unknown, inference is 
more challenging. Chib (1998) estimates a collection of changepoint models and compares 
these using Bayes factors estimated from the MCMC output. Green (1995) uses reversible 
jump MCMC (RJMCMC) to explore the number of changepoints in the coal mining disaster 
data. RJMCMC allows moves between models which satisfy detailed balance. 

The use of alternatives to MCMC has grown in this area in recent years. Fearnhead 
(2006) uses filtering recursions to derive the posterior distribution of changepoints. This 
can be done for both a known and unknown number of changepoints. An advantage of this 
approach is that one can draw independent samples from the posterior. MCMC can only 



do this approximately at best. Extension to online analysis of changepoint models is also 
possible (Fearnhead & Liu 2007). However methods based on filtering recursions rely on 
strong prior information in most cases. This paper aims to offer an efficient MCMC alter- 
native which can overcome strong reliance on prior assumptions as encountered in recursive 
computing approaches. The class of models considered is similar to Fearnhead (2006). For 
this reason it is possible that this could be used to give useful starting values for an analysis 
using filtering recursions. 

Qualitatively, the work in this paper is similar in some aspects to work by Lavielle & 
Lebarbier (2001) and Punskaya et al. (2002) in terms of the class of models considered. The 
sampling aspect of the approach bears similarities to the samplers of Lavielle & Lebarbier 
(2001) and Giron et al. (2007). This paper extends these works to a broader range of data 
models and proposes a more efficient way of sampling changepoints. An aim is also to 
highlight possible shortcomings of alternatives to MCMC and how these could be overcome 
by using simulation approaches to inform choices for recursive computing approaches. 

The remainder of the paper is organised as follows. In Section [2] the type of changepoint 
model under consideration is presented. Section [3] reviews the reversible jump approach to 
changepoint estimation and discusses how this can be simplified into a fixed dimensional 
sampling scheme. Section [4] gives the moves to sample from the simpler fixed dimensional 
posterior. Prior specification is discussed in Section |5j and Section [6] reviews the filtering 
recursion approach to generating samples of changepoints. Performance of the sampler is 
validated by analyzing the coal mining disasters data in Section [7j while Sections [8] and [9] 
compare qualitative aspects of the simulation based sampler approach and filtering recursions 
approach using two real data examples. A brief discussion concludes the article. 

2 Changepoint models 

Consider the data yi m = (y\, . . . ,y n ) which is time ordered. Here yi is observed before yj if 
% < j. Time in this context can refer to any natural ordering of the data as it is observed. A 
changepoint occurs at time t if y%, . . . , yt are generated differently to yt+i, . . . ,y n - Referring to 
y s:r (s < r) as a segment, this says that the segments y\± and yt+v.n are heterogeneous between 
but homogeneous within. Parametric changepoint models assign a different parameter for 
each segment to account for this heterogeneity. 

This paper considers multiple changepoints which will be denoted r±, . . . , T\-. These split 
the data into k + 1 segments. The likelihood for segment j has parameter 9j. Conditional on 
a segmentation, the data within each segment is assumed independent. It is also assumed 
that the regime parameters 9j are independent. The likelihood of the segmentation r = 
(ri, . . . ,r k ) is 

fc+l Tj 

n n 

j=i t= Tj _ 1 +i 

where for convenience r = 0, r fc+1 = n. Instead of using r, segmentations can be labelled 
with the binary latent vector z — (z±, . . . , z n ) with z t = 1 indicating a changepoint at time 
t and z n = 0. Independent priors are assumed for each member of 9 = (9\, . . . , 9k+i) with 
hyperparameter 7 and there is a prior for the changepoints with hyperparameter £, given by 
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7r(z|fc, £). The posterior may be written 

n(z,9\y, fe,C,7) oc 7r(z|A;,^)7r(0|A;,7)7r(y|0, A;) 

= ^lon^iT) n <vi\ G i) 

j=l i=Tj-_i+l 

where the dependence on the number of changepoints, k, is made explicit. A prior 7r(k) may 
be introduced so that the posterior of interest is the joint posterior of (k, z, 9), 

n(k, z,9\y, £, 7) oc ir(k)iT(z, 6\y, k, £, 7). (1) 

This is a hierarchical changepoint model similar to that used in Green (1995). 

3 Collapsing changepoint models 

It is possible to construct a MCMC scheme to sample the posterior of ([T| using RJM- 
CMC (Green 1995). The sampler will explore the product space support of this posterior: 

X = l[{k}x{Z k ,Q k \k} 

k 

where Z k ,Q k are respectively the sample spaces of z and 9 conditional on k changepoints. 
A switch in the number of changepoints in the model can be made by a RJ move switching 
between support subspaces. For the purposes of illustration a straightforward move of this 
type is now discussed. When proposing a switch from k to k + 1 changepoints one possibility 
is to generate a random variable m6R <! and form a bijection / : Q k x M. d — > O k +i where d 
is the dimension of a single 9j. This bijection gives the parameters for the proposed k + 1 
changepoint model as a function of those for the k changepoint model; 9' = (9[, . . . , 9' k+2 ) = 
f(6i, . . . , 6 k+ \,u). The proposed switch in model is then accepted with probability min(l, R) 
where 

8(6') 



R= ■K{k + l,z',6'\y,t, 1 )P{k + l,k) 1 



8(9, u) 



tt(M,%£,7) P(k,k + l)q(u\6) 
In the expression for R, P(-,-) denotes the proposal probability for transitions between 
different numbers of changepoints, and q(-\0) is the proposal density of u. The last term 
on the right is a Jacobian term for the bijection /. The reverse move in switching from 
k + 1 to k changepoints is accepted with probability min(l,i? _1 ). More elaborate moves 
between support subspaces are possible which propose changes to the model of more than 
one dimension or involve stochastic moves in both directions. 

The key questions in a changepoint analysis are usually; how many changepoints are there 
and where are the changepoints? The segment parameters 9 can be viewed as a nuisance 
parameter in this regard. Choosing conjugate priors for the 9j allows these to be collapsed 
in the model 

k+1 „ t-j 

x(k,z\y,£,"/) oc ir(k)ir(z\k,£) JJ / Ti(9j\^) JJ Tx(y i \9 j )d9 j 

j=l J i=Tj_i+l 

= Tr(k)Tr(z\k,0l[7r(y Tj _ 1+1 .. Tj \ 1 ), (2) 
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where 7r(y T ._ 1+ i :T . I7) is the marginal likelihood of the data segment y T ,-_i+i:T,- an d is assumed 
to be available in closed form due to the conjugacy. The support of this posterior is 

y = l[{k}x{z k \k} 

k 

and a switch from k to k + 1 changepoints does not require the design of a bijective function 
between support subspaces. The proposed switch in model is now accepted with Metropolis- 
Hastings probability min(l, A) where 

A _ n(k + l,z'\y^, 1 )P(k + l,k) (3) 

7r(My,£,7) p(k,k + iy u 

This idea of collapsing has been used previously in Punskaya et al. (2002) and Lavielle & 
Lebarbier (2001) for Gaussian data models. 

It can be seen that the first term on the right hand side of the acceptance ratio ^ is 
the Bayes factor for a model with k + 1 changepoints at positions z' versus a model with k 
changepoints at positions z, assuming all models are equally likely, a priori. Noting this, it 
becomes apparent that sampling k and z is equivalent to a model search over large model 
space. If there can be at most k changepoints, then the dimension of this space is J2k=o ("it ■ 
So searching for up to 5 changepoints in a dataset of length 200 corresponds to a dimension 
~ 2.5 x 10 9 . In the next section an MCMC scheme to search over these large model spaces, 
that is, sample from the posterior ([2j, is proposed. 



4 Sampling changepoints 

The MCMC scheme to generate samples of changepoints from the posterior ^ consists of 
three possible moves: add a changepoint; delete a changepoint; move a changepoint. Each 
sweep consists of the following; 

i. Choose to add or delete a changepoint with probabilities a k and dk = 1 — dk respectively. 
Clearly = do = 0. 

ii. Select a changepoint and propose to move it to a position in the range of its closest 
neighbouring changepoints. 

Add or delete a changepoint 

This move has been dicussed in Section [3] but more details are given here. Suppose there 
is currently k changepoints at postions z. Let z correspond to changepoints at Ti, . . . , t^. 
Randomly select one of the n — k — 1 points where there could be a changepoint i.e. a.t<n 
with z t = 0. Say this is currently in segment j given by y r _i+i:r - Relabel the proposed 
changepoints in z' as t[, . . . , r' k+1 with r- = t. Cancellation of marginal likelihood terms then 
implies that 

7T(k + l,z'|?/,£,7) _ 7T{k + 1) TT{z'\k + 1,Q ^j-x+l^j bM^ + l:^ hj 

7r(Mh/>£,7) n(k) n(z\k,£,) -K(y T ._ 1+1:T .\~/) 
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so calculation of A in (|3]) only requires at most three marginal likelihood values. Conversely, 
for the delete move, one of the k+1 changepoints in z' is chosen at random and the calculation 
of the acceptance probability involves 

tt^M^t) = n(k) n(z\k,£) ^(^,,+1^17) 

n(k + l,z'\y,£,j) ir(k + 1) n(z'\k + 1, ^(i/^+i^ItMv^+i^+Jt)' 

Finally, the proposal one step transition probabilities for the number of changepoints will 
be P(k,k + 1) = a k /{n — k — 1) and P{k + 1, k) = d k+1 /(k + 1), so that A (|3| can be 
computed. The acceptance probability for the add move is then min(l,A) and the delete 
move is accepted with probability min(l, A^ 1 ). 

Move a changepoint 

Gibbs update: Given the model assumption that the marginal likelihood for any segment is 
available in closed form, it is possible to update the position of any changepoint from its 
full conditional. Suppose Tj is being updated. Then the conditional probability that Tj = t, 
Tj-\ < t < Tj+i is proportional to 

n(z{ t) \k)Tr(y Tj _ l+1:t \j)Tr(y t+ i.. Tj+1 \j) 

where corresponds to changepoints n, . . . , Tj-%, t, Tj + i, . . . , Tk- The effort required for the 
Gibbs update is 0(r J+1 — r^-i) and so may be computationally expensive for large datasets 
with changepoints far apart, or datasets with many changepoints. In this situation a local 
random walk update may be preferred. 

Local random walk update: t is drawn uniformly from the integers max(rj— I, Tj_i+1), . . . , min(rj+ 
/, Tj + i — 1) where / specifies the locality of the proposed move. The move is accepted with 
probability min(l, B) where 

B = ^{yr^ 1 + l:t\iy{yt+l:r ] + 1 \l) 

ir(y Tj _ 1+ i lT . \j)ir(y Tj+1:Tj+1 \j) ' 

In the event that Tj — l < and t < Tj, B must be multiplied by (j-j — Tj-i + l)/ (t—Tj-i+l). 
Similar modifications are needed if t > Tj or Tj + I > Tj+±. 

Mixture of updates: A mixture of the two moves above should improve mixing and not be 
overly computationally expensive. For example, choose the Gibbs update with probability 
gp, = l/y/k (k > 1) and random walk with probability r k = 1 — g k . 

5 Prior specification 

There are many possible choices for ir(z\k,£). Yao (1984) considers a geometric distribution 
for the duration, d, of segments; d ~ Geometric (p). The prior used by Green (1995) has been 
adapted by Fearnhead (2006) for the discrete time context discussed here. The k changepoint 
locations are distributed as the even numbered order statistics in a sample of size 2k + 1 
from the integers 1, . . . , n — 1, drawn without replacement. 

The geometric prior relies on specification of £ = p. Ideally, one could simulate a segment 
specific Pj in a similar vein to Chib (1998). However this leads to more difficult jump 
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dynamics when adding or deleting a changepoint. The choice of p may impact the analysis. 
If too small, then it will assign very small probability to changepoints, meaning small changes 
cannot be detected with high power. If too large, then spurious changepoints are inferred. 
For these reasons, it desireable to introduce a hyperprior on p. For example, a Beta(ai, 02) 
prior with 1 < a\ < «2 (more weight less than 0.5), would be an ideal choice if there is 
enough prior information to choose ai,a^- Otherwise, a non-informative Beta(l, 1) prior 
would suffice. 

Segment parameters share a common hyperparameter 7 in Section [2] It is therefore 
possible to explore uncertainty in 7 also by introducing a hyperprior 7r(7). 

Sampling p and 7 can be easily incorporated into the MCMC scheme in Section [3j One 
sweep of the algorithm consists of: 

1. Sample the changepoints. 

2. Conditional on the changepoints sample p. 

3. Conditional on the changepoints sample 9. 

4. Conditional on 9 sample 7 and discard the 9 values. 

For the last step here, it will often be possible to sample 7 using a Gibbs step. However, if 
this is not possible, a simple random walk Metropolis-Hastings could be used. 

6 Analysis by filtering recursions 

It is useful to give a brief recap of the filtering recursions analysis of Fearnhead (2006) 
based on a point process prior for changepoint positions. Liu & Lawrence (1999), Barry 
& Hartigan (1992) have also used these types of methods for the analysis of changepoint 
problems. Define 

R-fit) = Pr{y Un I changepoint at t — 1,7}. 

It is possible to compute this quantity in a backward recursion. Defining i? 7 (n) = ^{ynl^), 
for t = n — 1 , . . . , 2 

n 

R^t) = 7l {yt-.s\l)R'y{s + l)g(s -t + l) + 7r(y t:n | 7 )(l - G(n - 1 + 1)) 

s=t 

and 

n-l 

Ryil) = ^(Vi-MR^s + l)g Q (s) + 7r(y 1:n | 7 )(l - G (n - 1)) 

s=l 

where the dependence of R 1 (t) on the hyperparameter 7 has been made explicit. Here g(-) 
gives the point process for the changepoint positions and G(-) the corresponding cumula- 
tive distribution function (the subscript on g and G in -R 7 (l) denotes the distribution of 
the first changepoint after 0). Yao (1984) takes this as geometric as do Barry & Hartigan 
(1992). Fearnhead (2006) suggests a negative binomial family in general for this process. 

After computing the recursions, a sample of size N of the changepoints can be efficiently 
simulated as follows: 
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1. Initialize all samples to have a changepoint at t = 0. 

2. For t = 0, . ..,n-2 

(a) Get rit, the number of samples for which the last changepoint was at time t. 

(b) If n t > compute the distribution of the next changepoint: 

Pr{r|j/i :n , *} = 7r(^ +1:r | 7 ) J R 7 (r + l)(/(r - tV-R^t + 1) 

(c) Sample n t times from Pr{r|yi :n , t} and update the nt samples that have the last 
changepoint at t. 

There are two strengths of this approach. The first is that the samples of changepoints 
will be independent draws from the posterior distribution. The second is the fast sampling 
algorithm which avoids computing the distribution of the next changepoint for each possible 
time. The main weakness of this approach is that the generated samples are dependent on 
a fixed value of the hyperparameters 7. Updating 7 using a hyperprior to correctly explore 
uncertainty in the value would involve recomputing the recursions i? 7 (i) for each new value 
of 7, a computation which is quadratic in n. This would lead to an infeasible computational 
overhead for any reasonably large sample from the posterior. 



7 Poisson data: coal mining disasters 

The sampler of Section [1] was applied to the coal-mining data of Jarrett (1979). This data 
records the dates of serious coal-mining disasters between 1851 and 1962. Disasters are 
assumed to arise from a Poisson process whose intensity is the height of a step function with 
an unknown number of steps. For comparison with Fearnhead (2006), time is discretized in 
weeks and the intensities are taken to be Gamma(l, 200/7), a priori. Details on the model 
marginal likelihood calculations are given in the Appendix. Conditional on k changepoints 
the prior on their positions was taken to be the same as the distribution of the even numbered 
order statistics of a sample of size 2k + 1 drawn without replacement from {1, . . . ,n — 1} 
(Fearnhead 2006), 

(n — 1 \ 1 k 
2k +1) n^+ 1- ^" 1 )' 

where for convenience, tq = and Tk+i = n. The algorithm was run for 500,000 sweeps after 

tli 

10,000 burn in. Every 50 sample was taken to reduce dependency in the MCMC iterates. 
This took 10 seconds on a 2.5GHz processor. Figure [I] (a) shows that the posterior number 
of changepoints is almost identical to that obtained from long runs of a RJMCMC sampler 
and methods based on recursions (see Fearnhead (2006), Figure l.(a)). 
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Figure 1: Coal mining disasters: (a) Posterior number of changepoints (b) Plot of the 
autocorrelation function of the number of changepoints 




Figure 2: Streakiness dataset: Cumulative counts of Tiger Woods' tournament wins 
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8 Streakiness in sports 



A sportsperson is considered "streaky" if instead of having a constant success rate over 
time, they have periods of high success rate. Such data will generally be a binary sequence 
with a "0" denoting a loss and a "1" denoting a win. The data concerning Tiger Woods' 
championship wins from September 1996- June 2001 was given and analyzed by Yang (2004), 
and are reanalyzed using the sampler of Section |4} The cumulative counts are shown in 
Figure [2] (a). Following Yang (2004) the data as is assumed to arise as a sequence of Bernoulli 
trials, with a possible changing probability of success. The data is ordered by subsequent 
tournament, and if a changepoint occurs, it is assumed to do so at some tournament. Let 
Sj = Y^i=r - 1 +i Vii ^ ne number of sucesses in a segment. Then assuming a Beta(a, (3) prior 
for the probability of success in any segment, 

_ T{a + (3} Y{ Sj + a}T{Tj - t 3 _ x - Sj + 0} 
n y Tj _ 1+hT .\a,p) T{a]rm r{T ._ Tj _ 1 + a + f3} 

Details of this calculation are given in the Appendix. The parameters a and (3 were both 
set equal to 1. The distribution between changepoints was taken to be Geometric(p). The 
specification of p may have an effect on the outcome of the analysis. It is thus desirable 
to investigate uncertainty in its value. This is done in two ways. Firstly, a simulation 
study using the sampler of Section [4] is carried out, where there is a hyperprior placed on 
p. Secondly, outputs of analyses using filtering recursions (Fearnhead 2006) for a range of 
values p are compared. 

For the MCMC simulation study using the sampler proposed earlier, the hyperparameter 
given to p was uniform on [0, 1]. After each update of the changepoints the value of p was 
updated by drawing from its full conditional distribution which is Beta(A; + l,n — k). A 
discrete uniform prior on [0, . . . , 10] was taken for the number of changepoints. This gives 
no discriminating prior weight on a particular number of changepoints. The sampler was 
run 100 times each for 100,000 burn in iterations and a subsequent 1,000,000 iterations. To 

th 

reduce dependency in the sample, only every 100 LIi sample was stored. Each run took about 
1.5 min on a 2.5GHz processor. Changepoints were updated using the mixture of moves 
discussed in Section [IJ Figure [2] (b) shows the output from one of these runs, with the 
posterior probability of a changepoint at any tournament indicated by the dashed line and 
a scaled counts curve overlain. Figure [3] (a) shows posterior probability of the number of 
changepoints over the 100 runs of the sampler. It can be seen that the sampler performs 
consistently, giving similar results over the 100 runs. Figure [3] (b) shows a histogram for 
the sampled values of p from the last run. Posterior support for p is highest over the range 
[0,0.1]. 

For the filtering recursions analysis (Fearnhead 2006), the recursions of Section [6] were 
computed for p e [0, 0.1] following the analysis above. A sample of size 100,000 changepoints 
was generated and the posterior of the number of changepoints was computed for each 
value of p. The modal number of changepoints was recorded from this for each value of 
p and is shown in Figure |4} It is clear that the number of changepoints inferred in the 
filtering recursions analysis is very sensitive to the value of p for this data. It is questionable 
whether such an analysis would be useful for a practitioner since it is unclear how one could 
objectively choose p in this situation. Certainly an exploratory analysis would be necessary 
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Figure 3: Streakiness dataset: (a) Boxplots of posterior probability for a given number of 
changepoints for 100 independent runs of the sampler, (b) Histogram of marginal draws of p 
from one run in the MCMC sampler simulation study. 



before choosing the value of p to compute the filtering recursions. One suggestion is to use the 
sampler proposed here for an exploratory analysis of the posterior allowing for uncertainty in 
the specification of p. The MCMC sampler simulation study suggests that two changepoints 
is most likely although there is relatively strong support for up to five changepoints. In 
this case, specification of one value of p to generate samples of changpoints will not fully 
explore uncertainty in the posterior. As before, the output of the MCMC sampler shown 
from Figure [2] (b) shows that one change is clearly identified, but that there is considerable 
uncertainty in the other positions, hence the support for up to five changepoints. 



9 Gaussian changepoint models 

Gaussian changepoint models are widely used and studied. Models can include those with 
changing mean and/or variance across segments. The model assumed for the purposes of 
the example here is piecewise constant, where data in any segment is Gaussian distributed. 
Segments share a common error variance. Data point yi in segment j is assumed to arise in- 
dependently from a N(/ij, a 2 ) distribution. The segment means fij are assumed to arise from 
a Gaussian distribution with mean /i and variance u 2 a 2 , a priori. Denote 7 = (cx 2 ,/io, v 2 )- 
Segment length is assumed to have a geometric distribution with parameter p. This gives 
the log posterior (up to a constant) as 

log7r(/c, z\y,p, 7) = — {k + 1) log v — [n + k + 1) logo" + {n — k — 1) log(l — p) + k logp 
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Figure 4: Streakiness data: Modal number of changepoints from a filtering recursions analysis 
over a range of values of p. 

where ssj = Y^L^^+iVi an d Sj = ^[i Tj _ 1+ i Hi- Details of this calculation are given in the 
Appendix. 



Application to Well-log data 

The Well-log data (O Ruanaidh & Fitzgerald (1996)) records measurements of nuclear- 
magnetic response of underground rocks obtained by lowering a probe into a bore-hole. The 
probe records the response at regular points in time. As well as Fearnhead (2006) this data 
was also analyzed in Fearnhead & Clifford (2003). The data consists of 4050 measurements, 
some of which are outliers and were removed before analysis. The data are shown in Figure [5] 
The purpose of this example is to demonstrate how results from an analysis with filtering 
recursions may be sensitive to the choice of hyperparameters 7 and how a short run of the 
sampler could possibly provide good starting values. It is possible to fit a more elaborate 
state space model to the Well-log data, however, this is not considered here. 

Fearnhead (2006) chose the values p = 0.013, a = 2, 330, = 4.3, fi = 115,000 when 
analyzing the Well-log data in the section on inclusion of hyperpriors. Two simple exper- 
iments were performed here to investigate sensitivity of the posterior distribution to prior 
specification. One of p (Experiment 1) or o (Experiment 2) was varied over a grid on a small 
range keeping all other hyperparameter values fixed (details in Table [I]). The recursions 
of Section [6] were computed for each value on the grid and a sample of size 100,000 was 
generated from the posterior of the changepoints. The empirical posterior distribution of 
the number of changepoints was computed for each of these samples and the modal number 
of changepoints recorded. The results are summarized in Figure |6j It can be seen that the 
modal value of the posterior number of changepoints is sensitive to the values of both p and 
a. Thus choosing these values, a priori, places the posterior mass ir(k, z\y,p, 7) in the area 
determined by p and a and may not correctly represent the true posterior over all p, a. 
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Recursion Sensitivity 


Fixed 


Varied 


Experiment 1 
Experiment 2 


a = 2, 330, v = 4.3, /i = 115, 000 
p = 0.013, v = 4.3, /i = 115, 000 


p G [0.005,0.03] 
a e [2250,2750] 



Table 1: Well- log data: Experiments to investigate sensitivity of results of filtering recursions 
to prior specification 

For the Well-log data it would seem most sensible to carry out an analysis with inclusion 
of hyperpriors on p, o and /^o using the scheme outlined in Section [5j The hyperpriors 
used are 7r(p) oc 1, 7r(/i ) oc 1, 7r(u) oc 1/u, 7r(a) oc \jo. The bottom of Figure [5] shows the 
posterior probability of a change output from an algorithm run for 10,000 burn-in and 100,000 
subsquent iterations using a random walk update for changepoint positions. Ergodic mean 
estimators of the hyperparameters were a = 2360, p = 0.014, v = 3.99, /to = 113771.0. This 
took about 10 sec on a 2.5GHz processor with very diffuse starting values. This Gaussian 
model infers many changepoints as it picks up small changes in the mean and thus performs 
well for this data. 

A long run of the sampler was implemented so as to obtain a near independent sam- 
ple (1.8 x 10 7 iterations taking every 1,800^ sample; estimated integrated autocorrelation 
time of the number of changepoints ~ 1) of size 10,000 from the posterior distribution of 
changepoints and hyperparameters. This was compared with results from the independence 
proposal suggested by Fearnhead (2006). In the independence proposal MCMC scheme sug- 
gested in Fearnhead (2006), a sample of changepoints is generated using filtering recursions 
conditional on p = 0.013, a = 2,330,/io = 115, 000, v = 4.3. This sample is then used for 
an independence proposal and hyperparmeters are updated in the same way as done here. 
Figure [7] shows kernel density estimates constructed from samples of the hyperparameters for 
the sampler (dashed line) and independence proposal (solid line). It can be seen that there 
is a slight discrepancy in that the independence proposal leads to more peaked densities. 

In our implementation an independence proposal based on a sample of size 10,000 was 
used. This updating scheme for hyperparameters and changepoints was then run for 50,000 
iterations. Although the acceptance rate for moving between different changepoint con- 
figurations was high, the independence proposal distribution was highly degenerate. Only 
ten unique changepoint configurations were sampled in the 50,000 iterations of the MCMC 
scheme. For other datasets where less information is available to choose the hyperparameters 
to generate the independence proposal, it is possible that this could lead to highly biased 
sampling from the hyperpriors. 

In the sense of hyperprior incorporation and full exploration of the posterior distribution 
the MCMC sampler proposed performs better than the independence proposal. However, 
generating independent samples may be more costly in large datasets with many change- 
points. Nonetheless, it is clear that the inclusion of hyperpriors circumvents the sensitivity 
of posterior distribution of the changepoints to specification of the hyperparameters. This 
is a main advantage of the approach proposed here and makes the detection of changepoints 
more automatic. 
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Figure 5: Top: Well-log data. Bottom: Posterior probability of a changepoint in any position 
from 100,000 samples using the sampler with hyperpriors. 




Figure 6: Well-log data: Modal number of changepoints for a filtering recursions analysis 
of the Well-log data for Experiment 1 and Experiment 2. Experiment 1 varies p (left) and 
Experiment 2 varies a (right) 
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Figure 7: Well-log data: Comparison of long run of sampler to MCMC scheme with inde- 
pendent proposals from filtering recursions. Dashed lines give the density from the MCMC 
sampler output and solid lines give the density output from analysis using the independent 
proposal scheme suggested in Fearnhead (2006). 
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10 Discussion 



This paper has presented an MCMC method to perform retrospective inference for change- 
point model which are collapsable. The multiple changepoint problem is rephrased as a 
stochastic model search over a large models space, with the Bayes factors for competing 
models appearing in the acceptance probabilities for the MCMC sampling scheme. 

The performance of the sampler was verified for the benchmark coal mining disasters data. 
Application of the sampler to a streakiness dataset from sports revealed that posteriors for 
the number of changepoints can be diffuse. It was demonstrated that prior specification 
on the duration of segments plays a crucial role in the analysis of the models considered. 
Incorporation of hyperpriors to account for this revealed features of the posterior that would 
be missed by a popular filtering recursions analysis for changepoints. Application to the Well- 
log data further highlighted sensitivity of analysis by filtering recursions to prior specification. 
It was shown that output from a short run of our sampler can be used to give good values 
of the hyperparameters for this prior specification. 

In conclusion, the sampling scheme presented is shown to work well and can provide 
further insight and account for prior uncertainty in some difficult situations. It can be used 
as a useful exploratory tool or for a full analysis. Computer code implementing the sampler 
written in C may be downloaded from www.ucd.ie/statdept/jwyse. 



Appendix 

Calculations for the coal-mining example 

Given a segment y s: t, each yi ~ iid Poisson(yu). Here \x is the height of the step function 
that gives the intensity of the process between times s and t. Assume the prior for fi is 
Gamma(p, A) where 7 = (p, A). The marginal likelihood for the segment is then 

TT(y S :t\l) = / =^// _1 exp{-A//}]| — j-exp{-//}d// 

JO 1 \PS i- a Vi- 



r°° 1 

/ —n s " t+p - 1 exp{-{t-s + X + l)n}dfji 

JO r S :t 



where F s:t = Y\i= s y^ an< ^ ^s-.t — ^2i=sVi- Completing the integral of the Gamma density 
gives 

/ , x A" 1 T{S s:t + p} 



T{p}F a:t (t - s + A + 



Calculations for the streakiness example 

Within a segment y s:t , yi ~ iid Bernoulli(0). Taking a Beta(a, j3) prior on 0, the marginal 
likelihood is obtained from 
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where 7 = (a, (3). This reduces to 

r> + /3} /- 1 



iSs-.t+a-li 



1-0) 



t-s-S 3:t +/3 



d(/>. 



r{«}r{/3} y 

where = Y^i= s Vi- Completing the Beta integral gives 

r> + g} r{s s:t + a}r{t - g + 1 - s s .. t + g} 
n^tw r{ a }r{/3} r{t-s + i + « + /?} 

Calculations for Gaussian changepoint model 

The model for all the data may be written hierarchically as 

ir(k,z,9\y,p,7) oc 7c(z\k,p)7c(9\k, z, a, fi )ir(y\k, z, 0) 

fc+i , f , \ 
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2a 2 
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JJexp 



2a 2 



Completing the square on /ij and then performing integration of /ij over (—00, 00) gives the 
required posterior. 
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